nhslogo CS4132 Data Analytics

Eve Online: The popularity of modules by Ryan Suwandi¶

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).
Important Note 2: The notebook takes 15 minutes to run and may lag when all the outputs are displayed at once.¶

Table of Content (with relevant hyperlinks to sections)¶

Motivation & Background

Summary of Research Questions and Results

Dataset

Data Acquisition

Data "Cleaning"

EDA

Results and Conclusion

Further Investigation

References

Motivation and Background¶

Give an overview of the project, motivation, background and goals.

Eve Online is a Massively Multiplayer Online Role-Playing Game, or MMORPG for short, set in space. To go out of the space stations and explore in Eve, you need to use space ships. Each of the different ships has different attributes. One of these attributes is slots. In these slots you can fit modules which increase the capabilities of your ship. Some types of modules are:

  • Weapons: Used to deal damage to the enemy
  • Electronic warfare: used to debuff other ships
  • Logistics: used to buff other ships
  • Tanking: gives your ship more health/regeneration

Of course, there are many more categories, and each category has many different modules.

There is obviously some strategy as to which modules you would put on your ship. It is well known, for example, that you should not bring a pickaxe to a gunfight.

By using ship data from Eve, I would be able to not only gain insight into the different strategies for fitting modules on ships, I would also be able to know how the general community acts in this way. Potentially, this could have a link to wider human psychology.

There is one main goal: To find out the most common situations where given modules are used.

Summary of Research Questions & Results

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer

As the number of modules is very large, the analysis will group modules based on what they do.

  1. In a given region, are the modules more rarely used based on the availability of the modules in the region?

No correlation, however there are some modules rarely used.

  1. Given the price of a ship, are modules of a similar price more commonly used than on average?

Kind of, but many outliers

  1. Is the popularity of modules different when fighting player and non-player enemies?

Yes.

  1. Given a certain module, which modules (and how many) are most commonly fit alongside it?

Mostly modules that synergize, but other pairs exist.

Dataset

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. https://data.everef.net/killmails

This data contains descriptions of about 70% of all the kills that have happened in Eve. These descriptions are called "killmails". Killmails are a snapshot of the ship, its pilot, and its surroundings at a point in time - the point in time that the victim ship was destroyed. As players exploring in ships have a not low chance of getting killed, killmails cover most of the different situations in Eve. Killmails provide not only the ship and the modules on it at the time it was killed, but also the attackers that killed the ship, where the ship was at, and the pilot's alliance at the time.

  1. https://www.fuzzwork.co.uk/dump/latest/

This contains a lot of information about ids and attributes of things in the game. Some of the info that will be used is below.

  1. https://www.fuzzwork.co.uk/dump/latest/invTypes.csv

This maps the item's id to its name and details.

  1. https://www.fuzzwork.co.uk/dump/latest/invFlags.csv

This maps a flag to slots in the ship, useful to find out which items were fit on the ship and which items were carried in cargo.

  1. https://www.fuzzwork.co.uk/dump/latest/mapSolarSystems.csv

This maps the solar system's id to the name of the solar system, its security status, and other attributes.

  1. https://www.fuzzwork.co.uk/dump/latest/mapRegions.csv

This maps the region ID to the name of the region, and in certain cases, the faction the region belongs to.

  1. https://www.fuzzwork.co.uk/dump/latest/chrFactions.csv

This maps the faction ID to the name of the faction, its race, home solar system and corporation ID.

  1. https://esi.evetech.net/latest/markets/regionid/history?type_id=itemid

This is the historical market data from 1/8/2021 onwards. Replace "regionid" with the region id and "itemid" with the item id to get results.

Methodology¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).

Data Acquisition¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
In [1]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import numpy as np
from scipy import stats
import plotly.express as px

Part 1: Non-static data¶

In [2]:
sample = pd.read_csv("cut_mails.csv")
sample
Out[2]:
Unnamed: 0 killmail_id attackers killmail_time solar_system_id victim.position.x victim.position.y victim.position.z victim.character_id victim.corporation_id victim.damage_taken victim.items victim.ship_type_id victim.alliance_id victim.faction_id moon_id war_id
0 134 563 [{'damage_done': 15706, 'faction_id': 500012, ... 2007-12-06T00:24:00Z 30001429 NaN NaN NaN 1.184150e+09 338591511 15706 [{'flag': 87, 'item_type_id': 2444, 'quantity_... 24700 NaN NaN NaN NaN
1 312 1489 [{'character_id': 1184117757, 'corporation_id'... 2007-12-06T02:13:00Z 30003286 NaN NaN NaN 4.093779e+08 773499566 436 [] 670 283331937.0 NaN NaN NaN
2 441 1977 [{'alliance_id': 833571739, 'character_id': 10... 2007-12-06T03:04:00Z 30002098 NaN NaN NaN 9.289588e+08 908128976 4399 [{'flag': 5, 'item_type_id': 220, 'quantity_dr... 16240 NaN NaN NaN NaN
3 590 2419 [{'character_id': 121912466, 'corporation_id':... 2007-12-06T03:52:00Z 30001984 NaN NaN NaN 8.209644e+08 1000167 386 [] 670 NaN NaN NaN NaN
4 730 2897 [{'alliance_id': 628991027, 'character_id': 14... 2007-12-06T04:44:00Z 30000865 NaN NaN NaN 1.929333e+09 1000166 490 [] 11134 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
477256 1263247 102609213 [{'alliance_id': 99009764, 'character_id': 937... 2022-08-07T09:07:31Z 30001334 -1.708461e+11 5.973561e+09 7.392005e+11 2.112485e+09 98659263 8029 [{'flag': 13, 'item_type_id': 4405, 'quantity_... 626 99009331.0 NaN NaN NaN
477257 1263382 102609488 [{'damage_done': 1187, 'faction_id': 500011, '... 2022-08-07T09:29:02Z 30023410 -6.046522e+11 -8.566997e+10 -1.734189e+12 1.720391e+09 98520878 1187 [] 3766 99001317.0 NaN NaN NaN
477258 1263522 102609796 [{'character_id': 1192357732, 'corporation_id'... 2022-08-07T09:50:43Z 31001667 -4.277093e+12 -1.819450e+11 1.195963e+12 9.651552e+07 98684884 40756 [{'flag': 5, 'item_type_id': 12068, 'quantity_... 17920 99009116.0 NaN NaN NaN
477259 1263699 102610181 [{'character_id': 96015918, 'corporation_id': ... 2022-08-07T10:20:40Z 30002092 -3.808458e+12 3.863937e+11 -8.717327e+12 2.120252e+09 1000179 451 [] 670 NaN 500003.0 NaN NaN
477260 1263853 102610550 [{'alliance_id': 99006371, 'character_id': 211... 2022-08-07T10:48:27Z 30003737 1.311266e+12 1.724271e+11 -2.764761e+12 2.114119e+09 98512148 12993 [{'flag': 5, 'item_type_id': 24490, 'quantity_... 29990 99001932.0 NaN NaN NaN

477261 rows × 17 columns

This is a sample of killmails taken from the API. It is approximately 1/150 of the total data. The important columns are:¶

  • killmail_id: The id of the killmails, in order of when the kill happened.
  • attackers: These are the ships that contributed to the death of the ship in the killmail. Used in question 3.
  • solar_system_id: The id of the solar system that the kill is in. Used for question 1.
  • victim.items: The items found on the victim when the kill happened.
  • victim.ship_type_id: The victim's ship that was destroyed.
In [3]:
attackers_sample = pd.json_normalize(json.loads(sample.loc[460228, "attackers"].replace("\'", "\"").replace("True", "1").replace("False", "0")))
attackers_sample
Out[3]:
alliance_id character_id corporation_id damage_done final_blow security_status ship_type_id weapon_type_id faction_id
0 99003581.0 2.114314e+09 98614214.0 3400 1 1.3 17720 2913.0 NaN
1 99003581.0 9.403009e+07 98535868.0 585 0 0.8 11999 11999.0 NaN
2 NaN NaN NaN 10 0 0.0 24150 NaN 500010.0
3 99003581.0 2.115479e+09 98598862.0 0 0 5.0 12013 37612.0 NaN
4 99003581.0 2.114577e+09 98702890.0 0 0 5.0 12003 3025.0 NaN
5 99003581.0 1.836479e+09 98535868.0 0 0 4.5 11961 2897.0 NaN
6 99003581.0 2.113735e+09 98535868.0 0 0 5.0 17718 3025.0 NaN
7 99003581.0 2.118955e+09 98538918.0 0 0 4.9 12023 2109.0 NaN

This is the "attackers" of a killmail, displayed as a sample. The important columns are:¶

  • character_id: The id of the player that attacked, if it does not exist it is a non-player character
  • faction_id: The faction of the attacking enemy, only appears on non-players
In [4]:
items_sample = pd.json_normalize(json.loads(sample.loc[460228, "victim.items"].replace("\'", "\"")))
items_sample
Out[4]:
flag item_type_id quantity_destroyed singleton quantity_dropped
0 14 33076 1.0 0 NaN
1 5 12559 3.0 0 NaN
2 5 12559 NaN 0 1.0
3 92 31484 1.0 0 NaN
4 19 5973 1.0 0 NaN
5 30 23071 NaN 0 1.0
6 15 47255 NaN 0 1.0
7 27 2993 NaN 0 1.0
8 11 2364 1.0 0 NaN
9 28 2993 NaN 0 1.0
10 5 23085 4.0 0 NaN
11 93 31484 1.0 0 NaN
12 29 23071 NaN 0 1.0
13 14 28668 NaN 0 2.0
14 20 5405 1.0 0 NaN
15 13 2605 NaN 0 1.0
16 28 23071 1.0 0 NaN
17 12 2364 1.0 0 NaN
18 5 28668 91.0 0 NaN
19 27 23071 1.0 0 NaN
20 30 2993 NaN 0 1.0
21 29 2993 NaN 0 1.0

This is a sample from "victim.items" of a killmail. The important columns are:¶

  • flag: The place/slot on a ship the item was found in when the ship was destroyed.
  • item_type_id: The id of the item
  • quantity_destroyed: That amount of the item was destroyed, ignore if NaN
  • quantity_dropped: That amount of the item was put into the world in containers, ignore if NaN

In [5]:
market = pd.read_csv("market.csv").iloc[:,1:].set_index("Unnamed: 0")
market
Out[5]:
2021-08-01 2021-08-02 2021-08-03 2021-08-05 2021-08-06 2021-08-07 2021-08-08 2021-08-09 2021-08-10 2021-08-11 ... 11/9/2022 12/9/2022 13/9/2022 14/9/2022 15/9/2022 16/9/2022 17/9/2022 18/9/2022 19/9/2022 20/9/2022
Unnamed: 0
18 37.85 27.61 36.00 36.04 36.17 36.21 36.24 39.9 36.33 36.40 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 25990.00 20010.00 12538.46 456.00 456.00 472.00 473.00 13990.0 3000.00 1839.25 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 825.20 969.00 840.30 840.20 840.00 813.10 815.30 815.3 975.00 813.00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 574.70 574.80 575.00 575.00 576.00 576.10 576.40 576.6 576.60 576.80 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 1055.00 1057.00 1062.00 1060.00 2486.00 1067.00 1068.00 1068.0 1070.00 1101.00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11989 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 9.057182e+07 88218000.0 9.447667e+07 9.449231e+07 9.349214e+07 9.181571e+07 91390000.0 9.317692e+07 93298000.0 9.275667e+07
11993 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.321500e+08 134940000.0 1.342534e+08 1.336605e+08 1.327386e+08 1.295148e+08 132360000.0 1.378000e+08 140858620.7 1.453186e+08
11995 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.941762e+08 196285714.3 1.930211e+08 1.885273e+08 1.829000e+08 1.816880e+08 182280000.0 1.835571e+08 183306666.7 1.814350e+08
11999 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.441174e+08 144827777.8 1.409345e+08 1.381882e+08 1.334342e+08 1.313050e+08 132537735.9 1.323583e+08 131612500.0 1.317200e+08
12003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.512000e+08 146761538.5 1.483000e+08 1.479143e+08 1.422059e+08 1.306059e+08 124690909.1 1.208824e+08 120463636.4 1.203267e+08

15023 rows × 832 columns

This is the average price of items from August 1 2021 to today.¶

  • The index is the item id
  • The columns are the dates, and the numbers are the average prices of the item with that item id on that date.
In [6]:
volumes =  pd.read_csv("market_vol_final.csv").set_index("Unnamed: 0")
volumes
Out[6]:
24700 16240 11134 1944 24698 12034 672 583 638 627 ... 8335 31462 21320 14027 11217 509 23919 27673 18694 40696
Unnamed: 0
10000014 1.076923 1.185185 20.286082 1.285714 1.285714 2.029412 15.729323 2.076923 1.000000 1.176471 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000016 1.311475 8.812500 40.180905 1.914729 3.329700 1.837209 163.959799 13.613065 1.792857 2.459574 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000023 2.526480 5.149701 30.211587 1.531746 1.973799 3.550769 35.562814 4.263473 1.075472 1.870813 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000030 2.185185 17.695652 39.221106 2.118812 3.541311 1.935223 23.690955 6.854167 1.663755 4.304878 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000032 7.670854 131.648241 39.449749 3.502762 7.300254 2.654275 47.195980 12.392405 3.048571 14.979899 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000033 1.096774 16.200000 21.376884 1.963636 1.600000 1.000000 45.183417 4.910256 1.196262 1.935252 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000038 1.833333 1.840000 7.359195 1.121951 1.750000 1.000000 2.878788 2.181818 1.000000 1.833333 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000042 2.064057 58.664141 24.778894 2.529221 3.217143 2.347656 22.324121 7.000000 1.764706 3.675141 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000047 1.000000 1.622222 12.602532 1.051724 1.352941 1.083333 7.501511 1.895833 1.000000 1.400000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000048 1.148148 4.804487 59.920308 1.296000 1.377358 1.382979 198.141058 3.308017 1.111111 2.760000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000060 14.879397 13.229008 143.298995 3.274052 5.212291 7.836788 57.962312 6.147541 2.139373 7.805085 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000064 1.874046 47.233668 15.660804 3.392344 1.508287 1.214286 25.379397 4.899135 1.208333 5.414758 ... 8.169935 1.0 1372.555556 1.0 1.236842 1.0 NaN NaN NaN NaN
10000069 1.076923 2.938596 5.453172 1.140000 1.415094 1.095238 26.758794 2.325444 1.434783 2.240000 ... NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0

13 rows × 4090 columns

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Static data¶

In [7]:
inv = pd.read_csv("invTypes.csv").set_index("typeID")
inv
Out[7]:
groupID typeName description mass volume capacity portionSize raceID basePrice published marketGroupID iconID soundID graphicID
typeID
0 0 #System NaN 1.0 0.00 0.0 1 None None 0 None None None 0
2 2 Corporation NaN 0.0 0.00 0.0 1 None None 0 None None None 0
3 3 Region NaN 0.0 1.00 0.0 1 None None 0 None None None 0
4 4 Constellation NaN 0.0 1.00 0.0 1 None None 0 None None None 0
5 5 Solar System NaN 0.0 1.00 0.0 1 None None 0 None None None 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
369550 351064 'Loceros' Basic-H mk.0 A prototype MN Basic loadout with prototype we... 0.0 0.01 0.0 1 None 48000.0000 0 None None None 0
370308 368726 'Deathshroud' AM-M SKIN This SKIN only applies to Medium Amarr dropsui... 0.0 0.01 0.0 1 4 3000.0000 0 None None None 0
370488 368726 ‘Tairei’s Crimson’ AM-L SKIN This SKIN only applies to Light Amarr dropsuit... 0.0 0.01 0.0 1 4 3000.0000 0 None None None 0
370658 351844 Council's Modified Repair Tool By projecting a focused harmonic beam into dam... 0.0 0.01 0.0 1 None 1125.0000 0 None None None 0
371027 350858 X-MS16 Snowball Launcher The Mass Driver is a semi-automatic, multi-sho... 0.0 0.01 0.0 1 4 47220.0000 0 None None None 0

43050 rows × 14 columns

This is information on the items in the game. Important columns are:¶

  • typeID: The item id
  • typeName: The item name
In [8]:
groups = pd.read_csv("invGroups.csv")
groups
Out[8]:
groupID categoryID groupName iconID useBasePrice anchored anchorable fittableNonSingleton published
0 0 0 #System None 0 0 0 0 0
1 1 1 Character None 0 0 0 0 0
2 2 1 Corporation None 0 0 0 0 0
3 3 2 Region None 0 0 0 0 0
4 4 2 Constellation None 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
1451 367774 350001 Salvage Containers None 0 0 0 0 0
1452 367776 350001 Salvage Decryptors None 0 0 0 0 0
1453 368656 350001 Battle Salvage None 1 0 0 0 0
1454 368666 350001 Warbarge None 1 0 0 0 0
1455 368726 350001 Infantry Color Skin None 1 0 0 0 0

1456 rows × 9 columns

These map the group ID to the group, which is the type of item.¶

  • flagID: the flag ID
  • flagName: The name of the slot
In [9]:
marketgroups = pd.read_csv("invMarketGroups.csv").set_index("marketGroupID")
marketgroups
Out[9]:
parentGroupID marketGroupName description iconID hasTypes
marketGroupID
2 None Blueprints & Reactions Blueprints are data items used in industry for... 2703 0
4 None Ships Capsuleer spaceships of all sizes and roles, i... 1443 0
5 1361 Standard Frigates Small, fast vessels suited to a variety of pur... 1443 0
6 1367 Standard Cruisers The middle children of the starship industry, ... 1443 0
7 1376 Standard Battleships The foundations of any respectable fighting fo... 1443 0
... ... ... ... ... ...
2815 9 Compressors NaN 25152 1
2816 209 Compressor Blueprints NaN 2703 1
2819 1612 Special Edition Electronic Attack Frigates Electronic Attack Frigates which have been off... 1443 1
2820 11 Structure Area Denial Ammunition Area denial ammunition, fired by structure def... 1004 1
2821 211 Structure Area Denial Ammunition Blueprints of area denial ammunition. 2703 1

1932 rows × 5 columns

In [10]:
flags = pd.read_csv("invFlags.csv")
flags
Out[10]:
flagID flagName flagText orderID
0 0 None None 0
1 1 Wallet Wallet 10
2 2 Offices OfficeFolder 0
3 3 Wardrobe Wardrobe 0
4 4 Hangar Hangar 30
... ... ... ... ...
131 178 Raffles Raffles Hangar 0
132 179 FrigateEscapeBay Frigate escape bay Hangar 0
133 180 StructureDeedBay Structure Deed Bay 0
134 181 SpecializedIceHold Specialized Ice Hold 0
135 182 SpecializedAsteroidHold Specialized Asteroid Hold 0

136 rows × 4 columns

These map the flag ID to the slot represented by that flag. This can be used to find which slots are relevant to the project.¶

  • flagID: the flag ID
  • flagName: The name of the slot
In [11]:
solarsystems = pd.read_csv("mapSolarSystems.csv")
solarsystems
Out[11]:
regionID constellationID solarSystemID solarSystemName x y z xMin xMax yMin ... corridor hub international regional constellation security factionID radius sunTypeID securityClass
0 10000001 20000001 30000001 Tanoo -8.851079e+16 4.236944e+16 -4.451353e+16 -8.851190e+16 -8.850926e+16 4.236930e+16 ... 0 1 1 1 None 0.858324 500007 1.323338e+12 45041 B
1 10000001 20000001 30000002 Lashesih -1.033010e+17 4.170750e+16 -2.985630e+16 -1.033016e+17 -1.032995e+17 4.170747e+16 ... 1 0 1 1 None 0.751689 500007 1.018400e+12 45037 B
2 10000001 20000001 30000003 Akpivem -9.117414e+16 4.393823e+16 -5.648282e+16 -9.117829e+16 -9.117334e+16 4.393819e+16 ... 0 1 0 0 None 0.846292 500007 2.473362e+12 3799 B
3 10000001 20000001 30000004 Jark -9.367593e+16 5.060424e+16 -2.840353e+16 -9.367738e+16 -9.367549e+16 5.060420e+16 ... 1 0 1 1 None 0.817001 500007 1.771412e+12 45030 B
4 10000001 20000001 30000005 Sasta -9.478216e+16 4.312625e+16 -3.189671e+16 -9.478287e+16 -9.477774e+16 4.312619e+16 ... 0 1 0 0 None 0.814337 500007 2.563946e+12 45040 B
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8480 14000005 24000025 34000196 V-196 -3.742976e+18 2.252058e+18 -6.137890e+18 -3.742991e+18 -3.742961e+18 2.252043e+18 ... 0 0 0 0 None -0.990000 None 1.495979e+13 None None
8481 14000005 24000025 34000197 V-197 -3.762613e+18 2.317148e+18 -6.127626e+18 -3.762628e+18 -3.762598e+18 2.317133e+18 ... 0 0 0 0 None -0.990000 None 1.495979e+13 None None
8482 14000005 24000025 34000198 V-198 -3.726805e+18 2.273820e+18 -6.118384e+18 -3.726820e+18 -3.726790e+18 2.273805e+18 ... 0 0 0 0 None -0.990000 None 1.495979e+13 None None
8483 14000005 24000025 34000199 V-199 -3.702467e+18 2.271227e+18 -6.075477e+18 -3.702482e+18 -3.702452e+18 2.271212e+18 ... 0 0 0 0 None -0.990000 None 1.495979e+13 None None
8484 14000005 24000025 34000200 V-200 -3.726768e+18 2.248087e+18 -6.097488e+18 -3.726783e+18 -3.726753e+18 2.248072e+18 ... 0 0 0 0 None -0.990000 None 1.495979e+13 None None

8485 rows × 26 columns

This is a list of all the solar systems and information about them.¶

  • regionID: the ID of the region the solar system is in
  • constellationID: the ID of the constellation (subregion) the solar system is in
  • solarSystemID: the ID of the solar system
In [12]:
regi = pd.read_csv("mapRegions.csv").set_index("regionID")
regi
Out[12]:
regionName x y z xMin xMax yMin yMax zMin zMax factionID nebula radius
regionID
10000001 Derelik -7.736195e+16 5.087803e+16 -6.443310e+16 -1.055500e+17 -4.917392e+16 2.712855e+16 7.462751e+16 2.642336e+16 1.024428e+17 500007 11799 None
10000002 The Forge -9.642033e+16 6.402708e+16 1.125398e+17 -1.436457e+17 -4.919500e+16 3.515456e+16 9.289960e+16 -1.444526e+17 -8.062703e+16 500001 11806 None
10000003 Vale of the Silent -4.406932e+16 9.472944e+16 1.813847e+17 -9.923376e+16 1.109511e+16 5.820417e+16 1.312547e+17 -2.188796e+17 -1.438898e+17 None 11814 None
10000004 UUA-F4 8.986800e+16 5.478010e+16 2.725758e+17 6.739083e+16 1.123452e+17 1.386504e+16 9.569515e+16 -3.807742e+17 -1.643773e+17 None 11817 None
10000005 Detorid 1.335404e+17 -3.139150e+16 -1.963923e+17 5.808592e+16 2.089949e+17 -5.072033e+16 -1.206267e+16 1.647489e+17 2.280357e+17 None 11849 None
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14000001 VR-01 -3.900972e+18 2.574945e+18 -8.266928e+18 -4.050972e+18 -3.750972e+18 2.424945e+18 2.724945e+18 -8.416928e+18 -8.116928e+18 None 11821 None
14000002 VR-02 -3.731107e+18 3.112926e+18 -8.155502e+18 -3.881107e+18 -3.581107e+18 2.962926e+18 3.262926e+18 -8.305502e+18 -8.005502e+18 None 11821 None
14000003 VR-03 -5.431842e+18 2.985429e+18 -6.018316e+18 -5.581842e+18 -5.281842e+18 2.835429e+18 3.135429e+18 -6.168316e+18 -5.868316e+18 None 11821 None
14000004 VR-04 -4.545299e+18 2.308091e+18 -6.316707e+18 -4.695299e+18 -4.395299e+18 2.158091e+18 2.458091e+18 -6.466707e+18 -6.166707e+18 None 11821 None
14000005 VR-05 -3.876324e+18 2.174764e+18 -5.975813e+18 -4.026324e+18 -3.726324e+18 2.024764e+18 2.324764e+18 -6.125813e+18 -5.825813e+18 None 11821 None

112 rows × 13 columns

A list of the regions in the star map and information about them¶

  • regionID: the ID of the region
  • regionName: the region's name
  • factionID: the faction the region belongs to. If NaN, is not owned by a single faction
In [13]:
factions = pd.read_csv("chrFactions.csv")
factions
Out[13]:
factionID factionName description raceIDs solarSystemID corporationID sizeFactor stationCount stationSystemCount militiaCorporationID iconID
0 500001 Caldari State The Caldari State is ruled by several mega-cor... 1 30000145 1000035 5.0 None None 1000180 1439
1 500002 Minmatar Republic The Minmatar Republic was formed over a centur... 2 30002544 1000051 5.0 None None 1000182 1440
2 500003 Amarr Empire The largest of the five main empires, the Amar... 4 30002187 1000084 5.0 None None 1000179 1442
3 500004 Gallente Federation The Gallente Federation encompasses several ra... 8 30004993 1000120 5.0 None None 1000181 1441
4 500005 Jove Empire The Jove Empire is isolated from the rest of t... 16 30001642 1000149 5.0 None None None 2195
5 500006 CONCORD Assembly CONCORD is an independent organization founded... 1 30005204 1000137 5.0 None None None 1434
6 500007 Ammatar Mandate The Ammatars are part of the Amarr Empire, but... 2 30000001 1000123 4.0 None None None 10172
7 500008 Khanid Kingdom The Khanid Kingdom, also known as the Dark Ama... 4 30003863 1000156 4.0 None None None 10173
8 500009 The Syndicate Formed by Intaki exiles from the Gallente Fede... 8 30003271 1000146 4.0 None None None 1437
9 500010 Guristas Pirates Formed by two former members of the Caldari Na... 1 30001290 1000127 4.0 None None None 1630
10 500011 Angel Cartel Operating from the heart of the Curse region, ... 1 30001045 1000138 4.0 None None None 10174
11 500012 Blood Raider Covenant The Amarr Empire has had its share of religiou... 4 30003088 1000134 3.0 None None None 1441
12 500013 The InterBus The InterBus is one of the more successful joi... 1 30005203 1000148 3.0 None None None 96
13 500014 ORE Outer Ring Excavations, or ORE, is the largest... 8 30004504 1000129 3.0 None None None 1720
14 500015 Thukker Tribe The Thukker tribe is one of the seven original... 2 30000905 1000163 3.0 None None None 10175
15 500016 Servant Sisters of EVE The Sisters of EVE are mainly known for their ... 1 30001978 1000130 3.0 None None None 1004
16 500017 The Society of Conscious Thought The Society of Conscious Thought is three cent... 16 30002423 1000131 3.0 None None None 10176
17 500018 Mordu's Legion Command The origin of Mordu's Legion lies in the Galle... 1 30002005 1000128 3.0 None None None 1722
18 500019 Sansha's Nation Sansha's Nation was founded more than a centur... 1 30001868 1000162 4.0 None None None 10177
19 500020 Serpentis The Serpentis Corporation was founded a few de... 1 30004623 1000135 4.0 None None None 10178
20 500021 Unknown Unknown 1 30005286 None 0.0 None None None 0
21 500024 Drifters Emerging from the ruins of the Sleeper civiliz... 16 30005286 1000274 0.0 None None None 21404
22 500025 Rogue Drones While rogues drones come in all shapes, sizes ... 134 30005286 1000287 0.0 None None None 20996
23 500026 Triglavian Collective The Triglavian Collective appears to be a huma... 135 30005286 1000298 5.0 None None None 20996
24 500027 EDENCOM EDENCOM is the New Eden Common Defense Initiat... 1 30005204 1000297 5.0 None None None 24419
25 500028 Association for Interdisciplinary Research The Association for Interdisciplinary Research... 4 30005305 1000413 5.0 None None None 21

This lists the factions in Eve.¶

  • factionID: the ID of the faction
  • factionName: the name of the faction
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Data Cleaning

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

Part 1: The killmails¶

Dataset used: https://data.everef.net/killmails

When the killmails are downloaded, they are in text files with json objects scattered in them. They are extracted and placed into dataframes. The columns are standardized for easy formatting. They are then saved and removed from memory when they are too large to fit (~5GB)

The csv files are combined into one and then a systematic sample (1 row every ~150 rows) is taken out. This is where sample comes from.

Part 2: The market data¶

The market data comes from the website in json format. Each entry is stored as a row, with the dates having no data being turned into NaN. The average is taken out along with the date, and the index of the row is the item id.

Part 3: The market data part 2¶

This time the volume is taken instead of the average, and the mean of all the days is taken to filter out NaN data.

Part 4: The pricing¶

A function is made to take the json of items out of the column. The items are then filtered out based on the 'flag' and then the prices are extracted from the market data. If the price is NaN, it takes the average price. Same for the ship. The information is then stored in a dataframe.

Part 5: The pvp/non-pvp information¶

Similarly to the pricing data, the attackers and items are extracted from their jsons. Based on how the attackers are, the items are saved into a dataframe.

For more information on the code, check the appendix. In the interest of time, none of the code will be placed here.

EDA

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

Note: data processing that takes too long will be skipped, and the finished files imported instead.¶

Q1: In a given region, are the modules more rarely used based on the availability of the modules in the region?¶

As the numbers are whole numbers, having many 0s would bias the dataset. So, we limit the data to modules with over 500 total occurences, and regions which have more than 10,000 modules recorded.

In the interest of time, calculation of the number of items found in each region throughout the killmail data has already been done and stored in region_items.csv.

In [14]:
regions = pd.read_csv("region_items.csv")
regions = regions.groupby("Region").sum()
regions = regions.T[regions.sum()>500].T[regions.sum(axis=1)>100000]
regions2 = regions.copy()
regions = regions / regions.sum()
print(regions.index.values.tolist())
for ind in regions.index.values:
    plt.figure(figsize=(20,8))
    bot15 = regions.loc[ind].sort_values(ascending=True)[0:]
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = px.bar(x = inv.loc[x, "typeName"], y=bot15.values)
    plot.update_layout(title_text = (regi.loc[ind, "regionName"]), yaxis_title = "Relative Frequency", xaxis_title = "Module")
    plot.show()
regions
[10000002, 10000014, 10000016, 10000023, 10000030, 10000032, 10000033, 10000038, 10000042, 10000047, 10000048, 10000060, 10000064, 10000069]
Out[14]:
12773 31179 2913 3841 519 8089 1999 27387 5975 3244 ... 33474 33816 34317 34562 34828 35683 42685 12198 47466 49710
Region
10000002 0.080018 0.115866 0.074108 0.086198 0.188332 0.074222 0.083310 0.039119 0.095040 0.054379 ... 0.255172 0.036458 0.027426 0.054492 0.055901 0.050754 0.156951 0.000000 0.135659 0.041152
10000014 0.126062 0.107516 0.113890 0.162075 0.088424 0.152851 0.136867 0.180389 0.127495 0.112199 ... 0.068966 0.067708 0.132911 0.142857 0.114907 0.091907 0.031390 0.267442 0.046512 0.045267
10000016 0.032633 0.043841 0.041828 0.051538 0.035217 0.040808 0.033075 0.027332 0.037844 0.026594 ... 0.113793 0.026042 0.025316 0.030928 0.045031 0.028807 0.067265 0.000000 0.100775 0.061728
10000023 0.055878 0.070981 0.054103 0.100914 0.053969 0.089436 0.062137 0.139563 0.060747 0.061127 ... 0.055172 0.093750 0.086498 0.123711 0.093168 0.057613 0.040359 0.322674 0.069767 0.082305
10000030 0.077783 0.051148 0.063878 0.036056 0.059965 0.046922 0.060753 0.034677 0.049479 0.058150 ... 0.051724 0.046875 0.050633 0.042710 0.018634 0.042524 0.049327 0.000000 0.027132 0.028807
10000032 0.026375 0.039666 0.030689 0.034173 0.025765 0.041945 0.032937 0.010079 0.037232 0.025073 ... 0.031034 0.031250 0.021097 0.020619 0.017081 0.045267 0.098655 0.000000 0.096899 0.028807
10000033 0.060796 0.080376 0.063424 0.060813 0.094573 0.076923 0.067672 0.068671 0.078138 0.088714 ... 0.062069 0.140625 0.086498 0.079529 0.085404 0.085048 0.139013 0.000000 0.089147 0.131687
10000038 0.080018 0.057411 0.086383 0.035079 0.063319 0.045073 0.068364 0.037752 0.065891 0.079518 ... 0.044828 0.062500 0.105485 0.070692 0.034161 0.028807 0.053812 0.000000 0.034884 0.069959
10000042 0.073759 0.037578 0.089111 0.043308 0.074042 0.051614 0.065735 0.029723 0.057440 0.058018 ... 0.058621 0.062500 0.048523 0.033873 0.029503 0.057613 0.067265 0.000000 0.081395 0.074074
10000047 0.094323 0.092902 0.089793 0.092963 0.055900 0.065264 0.083310 0.063375 0.088059 0.072374 ... 0.041379 0.041667 0.105485 0.091311 0.080745 0.071331 0.022422 0.200581 0.050388 0.045267
10000048 0.087170 0.061587 0.070925 0.052793 0.048379 0.047775 0.060338 0.047831 0.071647 0.074226 ... 0.062069 0.065104 0.061181 0.038292 0.068323 0.069959 0.085202 0.000000 0.046512 0.069959
10000060 0.083594 0.093946 0.082519 0.132854 0.068554 0.112043 0.096457 0.200205 0.070545 0.077335 ... 0.068966 0.057292 0.061181 0.076583 0.243789 0.167353 0.026906 0.209302 0.143411 0.152263
10000064 0.024139 0.048017 0.048193 0.033336 0.036030 0.035831 0.043039 0.021182 0.053644 0.051535 ... 0.027586 0.057292 0.037975 0.038292 0.021739 0.057613 0.058296 0.000000 0.038760 0.053498
10000069 0.097452 0.099165 0.091157 0.077899 0.107531 0.119295 0.106006 0.100102 0.106797 0.160757 ... 0.058621 0.210938 0.149789 0.156112 0.091615 0.145405 0.103139 0.000000 0.038760 0.115226

14 rows × 1045 columns

<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
In [15]:
regions2 = regions2.loc[volumes.index.values]
regions2 = regions2 / regions2.sum()
regions2
Out[15]:
12773 31179 2913 3841 519 8089 1999 27387 5975 3244 ... 33474 33816 34317 34562 34828 35683 42685 12198 47466 49710
Region
10000014 0.137026 0.121606 0.123005 0.177364 0.108941 0.165105 0.149306 0.187733 0.140885 0.118651 ... 0.092593 0.070270 0.136659 0.151090 0.121711 0.096821 0.037234 0.267442 0.053812 0.047210
10000016 0.035471 0.049587 0.045176 0.056399 0.043388 0.044079 0.036081 0.028444 0.041819 0.028124 ... 0.152778 0.027027 0.026030 0.032710 0.047697 0.030347 0.079787 0.000000 0.116592 0.064378
10000023 0.060739 0.080283 0.058434 0.110433 0.066491 0.096606 0.067784 0.145244 0.067127 0.064643 ... 0.074074 0.097297 0.088937 0.130841 0.098684 0.060694 0.047872 0.322674 0.080717 0.085837
10000030 0.084548 0.057851 0.068991 0.039457 0.073879 0.050683 0.066274 0.036089 0.054676 0.061494 ... 0.069444 0.048649 0.052061 0.045171 0.019737 0.044798 0.058511 0.000000 0.031390 0.030043
10000032 0.028669 0.044864 0.033145 0.037396 0.031743 0.045308 0.035930 0.010489 0.041142 0.026515 ... 0.041667 0.032432 0.021692 0.021807 0.018092 0.047688 0.117021 0.000000 0.112108 0.030043
10000033 0.066084 0.090909 0.068500 0.066550 0.116516 0.083090 0.073822 0.071467 0.086345 0.093816 ... 0.083333 0.145946 0.088937 0.084112 0.090461 0.089595 0.164894 0.000000 0.103139 0.137339
10000038 0.086978 0.064935 0.093297 0.038388 0.078012 0.048687 0.074577 0.039289 0.072811 0.084091 ... 0.060185 0.064865 0.108460 0.074766 0.036184 0.030347 0.063830 0.000000 0.040359 0.072961
10000042 0.080175 0.042503 0.096244 0.047394 0.091222 0.055752 0.071709 0.030933 0.063473 0.061354 ... 0.078704 0.064865 0.049892 0.035826 0.031250 0.060694 0.079787 0.000000 0.094170 0.077253
10000047 0.102527 0.105077 0.096980 0.101732 0.068871 0.070496 0.090882 0.065956 0.097307 0.076536 ... 0.055556 0.043243 0.108460 0.096573 0.085526 0.075145 0.026596 0.200581 0.058296 0.047210
10000048 0.094752 0.069658 0.076602 0.057773 0.059604 0.051605 0.065821 0.049778 0.079172 0.078494 ... 0.083333 0.067568 0.062907 0.040498 0.072368 0.073699 0.101064 0.000000 0.053812 0.072961
10000060 0.090865 0.106257 0.089123 0.145387 0.084460 0.121026 0.105223 0.208356 0.077954 0.081783 ... 0.092593 0.059459 0.062907 0.080997 0.258224 0.176301 0.031915 0.209302 0.165919 0.158798
10000064 0.026239 0.054309 0.052050 0.036480 0.044390 0.038704 0.046950 0.022044 0.059277 0.054498 ... 0.037037 0.059459 0.039046 0.040498 0.023026 0.060694 0.069149 0.000000 0.044843 0.055794
10000069 0.105928 0.112161 0.098453 0.085248 0.132482 0.128859 0.115640 0.104178 0.118013 0.170001 ... 0.078704 0.218919 0.154013 0.165109 0.097039 0.153179 0.122340 0.000000 0.044843 0.120172

13 rows × 1045 columns

In [16]:
volumes = volumes / volumes.sum()
volumes
Out[16]:
24700 16240 11134 1944 24698 12034 672 583 638 627 ... 8335 31462 21320 14027 11217 509 23919 27673 18694 40696
Unnamed: 0
10000014 0.027097 0.003811 0.044119 0.049218 0.036878 0.070059 0.023397 0.028899 0.051453 0.022688 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000016 0.032998 0.028334 0.087388 0.073298 0.095506 0.063424 0.243890 0.189417 0.092248 0.047432 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000023 0.063569 0.016557 0.065706 0.058637 0.056614 0.122579 0.052900 0.059324 0.055336 0.036078 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000030 0.054982 0.056895 0.085300 0.081110 0.101575 0.066807 0.035240 0.095372 0.085605 0.083017 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000032 0.193009 0.423274 0.085798 0.134089 0.209393 0.091630 0.070204 0.172433 0.156858 0.288879 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000033 0.027596 0.052086 0.046492 0.075170 0.045893 0.034522 0.067210 0.068323 0.061551 0.037320 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000038 0.046129 0.005916 0.016005 0.042949 0.050195 0.034522 0.004282 0.030359 0.051453 0.035355 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000042 0.051934 0.188616 0.053891 0.096821 0.092277 0.081045 0.033207 0.097401 0.090799 0.070873 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000047 0.025161 0.005216 0.027409 0.040261 0.038806 0.037399 0.011159 0.026379 0.051453 0.026998 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000048 0.028889 0.015447 0.130318 0.049612 0.039507 0.047743 0.294735 0.046029 0.057170 0.053225 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000060 0.374385 0.042534 0.311655 0.125334 0.149504 0.270540 0.086219 0.085539 0.110077 0.150517 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10000064 0.047153 0.151865 0.034060 0.129862 0.043262 0.041919 0.037752 0.068168 0.062172 0.104421 ... 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
10000069 0.027097 0.009448 0.011860 0.043640 0.040589 0.037810 0.039804 0.032357 0.073824 0.043197 ... NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0

13 rows × 4090 columns

In [17]:
for ind in regions2.index.values:
    plt.figure(figsize=(10,4))
    bot15 = regions2.loc[ind].sort_values(ascending=True)[0:25]
    x2 = bot15.index.values.tolist()
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = sns.barplot(x = inv.loc[x, "typeName"], y=bot15.values, alpha=0.5, color='#FF0000')
    volums = []
    if (ind != 10000002):
        volum = volumes.loc[ind]
        for i in x2:
            try:
                volums.append(volum[i])
            except:
                volums.append(np.NaN)
        sns.barplot(x = inv.loc[x, "typeName"], y=volums, alpha=0.5, color='#0000FF')       
    plt.title(regi.loc[ind, "regionName"])
    plt.xlabel("25 least used modules (relatiely)")
    plt.ylabel("Relative Frequency")
    plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
    plt.show()
regions
Out[17]:
12773 31179 2913 3841 519 8089 1999 27387 5975 3244 ... 33474 33816 34317 34562 34828 35683 42685 12198 47466 49710
Region
10000002 0.080018 0.115866 0.074108 0.086198 0.188332 0.074222 0.083310 0.039119 0.095040 0.054379 ... 0.255172 0.036458 0.027426 0.054492 0.055901 0.050754 0.156951 0.000000 0.135659 0.041152
10000014 0.126062 0.107516 0.113890 0.162075 0.088424 0.152851 0.136867 0.180389 0.127495 0.112199 ... 0.068966 0.067708 0.132911 0.142857 0.114907 0.091907 0.031390 0.267442 0.046512 0.045267
10000016 0.032633 0.043841 0.041828 0.051538 0.035217 0.040808 0.033075 0.027332 0.037844 0.026594 ... 0.113793 0.026042 0.025316 0.030928 0.045031 0.028807 0.067265 0.000000 0.100775 0.061728
10000023 0.055878 0.070981 0.054103 0.100914 0.053969 0.089436 0.062137 0.139563 0.060747 0.061127 ... 0.055172 0.093750 0.086498 0.123711 0.093168 0.057613 0.040359 0.322674 0.069767 0.082305
10000030 0.077783 0.051148 0.063878 0.036056 0.059965 0.046922 0.060753 0.034677 0.049479 0.058150 ... 0.051724 0.046875 0.050633 0.042710 0.018634 0.042524 0.049327 0.000000 0.027132 0.028807
10000032 0.026375 0.039666 0.030689 0.034173 0.025765 0.041945 0.032937 0.010079 0.037232 0.025073 ... 0.031034 0.031250 0.021097 0.020619 0.017081 0.045267 0.098655 0.000000 0.096899 0.028807
10000033 0.060796 0.080376 0.063424 0.060813 0.094573 0.076923 0.067672 0.068671 0.078138 0.088714 ... 0.062069 0.140625 0.086498 0.079529 0.085404 0.085048 0.139013 0.000000 0.089147 0.131687
10000038 0.080018 0.057411 0.086383 0.035079 0.063319 0.045073 0.068364 0.037752 0.065891 0.079518 ... 0.044828 0.062500 0.105485 0.070692 0.034161 0.028807 0.053812 0.000000 0.034884 0.069959
10000042 0.073759 0.037578 0.089111 0.043308 0.074042 0.051614 0.065735 0.029723 0.057440 0.058018 ... 0.058621 0.062500 0.048523 0.033873 0.029503 0.057613 0.067265 0.000000 0.081395 0.074074
10000047 0.094323 0.092902 0.089793 0.092963 0.055900 0.065264 0.083310 0.063375 0.088059 0.072374 ... 0.041379 0.041667 0.105485 0.091311 0.080745 0.071331 0.022422 0.200581 0.050388 0.045267
10000048 0.087170 0.061587 0.070925 0.052793 0.048379 0.047775 0.060338 0.047831 0.071647 0.074226 ... 0.062069 0.065104 0.061181 0.038292 0.068323 0.069959 0.085202 0.000000 0.046512 0.069959
10000060 0.083594 0.093946 0.082519 0.132854 0.068554 0.112043 0.096457 0.200205 0.070545 0.077335 ... 0.068966 0.057292 0.061181 0.076583 0.243789 0.167353 0.026906 0.209302 0.143411 0.152263
10000064 0.024139 0.048017 0.048193 0.033336 0.036030 0.035831 0.043039 0.021182 0.053644 0.051535 ... 0.027586 0.057292 0.037975 0.038292 0.021739 0.057613 0.058296 0.000000 0.038760 0.053498
10000069 0.097452 0.099165 0.091157 0.077899 0.107531 0.119295 0.106006 0.100102 0.106797 0.160757 ... 0.058621 0.210938 0.149789 0.156112 0.091615 0.145405 0.103139 0.000000 0.038760 0.115226

14 rows × 1045 columns

In [ ]:
 
In [18]:
for ind in regions2.index.values:
    plt.figure(figsize=(15,6))
    
    bot15 = regions2.loc[ind].sort_values(ascending=True)[-45:]
    x2 = bot15.index.values.tolist()
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = sns.barplot(x = inv.loc[x, "typeName"], y=bot15.values, alpha=0.5, color='#FF0000')
    volums = []
    if (ind != 10000002):
        volum = volumes.loc[ind]
        for i in x2:
            try:
                volums.append(volum[i])
            except:
                volums.append(np.NaN)
        sns.barplot(x = inv.loc[x, "typeName"], y=volums, alpha=0.5, color='#0000FF')       
    plt.title(regi.loc[ind, "regionName"])
    plt.xlabel("45 least used modules (relatiely)")
    plt.ylabel("Relative Frequency")
    plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
    plt.show()
regions
Out[18]:
12773 31179 2913 3841 519 8089 1999 27387 5975 3244 ... 33474 33816 34317 34562 34828 35683 42685 12198 47466 49710
Region
10000002 0.080018 0.115866 0.074108 0.086198 0.188332 0.074222 0.083310 0.039119 0.095040 0.054379 ... 0.255172 0.036458 0.027426 0.054492 0.055901 0.050754 0.156951 0.000000 0.135659 0.041152
10000014 0.126062 0.107516 0.113890 0.162075 0.088424 0.152851 0.136867 0.180389 0.127495 0.112199 ... 0.068966 0.067708 0.132911 0.142857 0.114907 0.091907 0.031390 0.267442 0.046512 0.045267
10000016 0.032633 0.043841 0.041828 0.051538 0.035217 0.040808 0.033075 0.027332 0.037844 0.026594 ... 0.113793 0.026042 0.025316 0.030928 0.045031 0.028807 0.067265 0.000000 0.100775 0.061728
10000023 0.055878 0.070981 0.054103 0.100914 0.053969 0.089436 0.062137 0.139563 0.060747 0.061127 ... 0.055172 0.093750 0.086498 0.123711 0.093168 0.057613 0.040359 0.322674 0.069767 0.082305
10000030 0.077783 0.051148 0.063878 0.036056 0.059965 0.046922 0.060753 0.034677 0.049479 0.058150 ... 0.051724 0.046875 0.050633 0.042710 0.018634 0.042524 0.049327 0.000000 0.027132 0.028807
10000032 0.026375 0.039666 0.030689 0.034173 0.025765 0.041945 0.032937 0.010079 0.037232 0.025073 ... 0.031034 0.031250 0.021097 0.020619 0.017081 0.045267 0.098655 0.000000 0.096899 0.028807
10000033 0.060796 0.080376 0.063424 0.060813 0.094573 0.076923 0.067672 0.068671 0.078138 0.088714 ... 0.062069 0.140625 0.086498 0.079529 0.085404 0.085048 0.139013 0.000000 0.089147 0.131687
10000038 0.080018 0.057411 0.086383 0.035079 0.063319 0.045073 0.068364 0.037752 0.065891 0.079518 ... 0.044828 0.062500 0.105485 0.070692 0.034161 0.028807 0.053812 0.000000 0.034884 0.069959
10000042 0.073759 0.037578 0.089111 0.043308 0.074042 0.051614 0.065735 0.029723 0.057440 0.058018 ... 0.058621 0.062500 0.048523 0.033873 0.029503 0.057613 0.067265 0.000000 0.081395 0.074074
10000047 0.094323 0.092902 0.089793 0.092963 0.055900 0.065264 0.083310 0.063375 0.088059 0.072374 ... 0.041379 0.041667 0.105485 0.091311 0.080745 0.071331 0.022422 0.200581 0.050388 0.045267
10000048 0.087170 0.061587 0.070925 0.052793 0.048379 0.047775 0.060338 0.047831 0.071647 0.074226 ... 0.062069 0.065104 0.061181 0.038292 0.068323 0.069959 0.085202 0.000000 0.046512 0.069959
10000060 0.083594 0.093946 0.082519 0.132854 0.068554 0.112043 0.096457 0.200205 0.070545 0.077335 ... 0.068966 0.057292 0.061181 0.076583 0.243789 0.167353 0.026906 0.209302 0.143411 0.152263
10000064 0.024139 0.048017 0.048193 0.033336 0.036030 0.035831 0.043039 0.021182 0.053644 0.051535 ... 0.027586 0.057292 0.037975 0.038292 0.021739 0.057613 0.058296 0.000000 0.038760 0.053498
10000069 0.097452 0.099165 0.091157 0.077899 0.107531 0.119295 0.106006 0.100102 0.106797 0.160757 ... 0.058621 0.210938 0.149789 0.156112 0.091615 0.145405 0.103139 0.000000 0.038760 0.115226

14 rows × 1045 columns

In [19]:
for ind in regions2.index.values:
    plt.figure(figsize=(15,6))
    
    bot15 = regions2.loc[ind].sort_values(ascending=True)
    x2 = bot15.index.values.tolist()
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = sns.barplot(x = inv.loc[x, "typeName"], y=bot15.values, alpha=0.5, color='#FF0000')
    volums = []
    if (ind != 10000002):
        volum = volumes.loc[ind]
        for i in x2:
            try:
                volums.append(volum[i])
            except:
                volums.append(np.NaN)
        sns.barplot(x = inv.loc[x, "typeName"], y=volums, alpha=0.5, color='#0000FF')       
    plt.title(regi.loc[ind, "regionName"])
    plt.xlabel("45 least used modules (relatiely)")
    plt.ylabel("Relative Frequency")
    plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
    plt.show()
regions
Out[19]:
12773 31179 2913 3841 519 8089 1999 27387 5975 3244 ... 33474 33816 34317 34562 34828 35683 42685 12198 47466 49710
Region
10000002 0.080018 0.115866 0.074108 0.086198 0.188332 0.074222 0.083310 0.039119 0.095040 0.054379 ... 0.255172 0.036458 0.027426 0.054492 0.055901 0.050754 0.156951 0.000000 0.135659 0.041152
10000014 0.126062 0.107516 0.113890 0.162075 0.088424 0.152851 0.136867 0.180389 0.127495 0.112199 ... 0.068966 0.067708 0.132911 0.142857 0.114907 0.091907 0.031390 0.267442 0.046512 0.045267
10000016 0.032633 0.043841 0.041828 0.051538 0.035217 0.040808 0.033075 0.027332 0.037844 0.026594 ... 0.113793 0.026042 0.025316 0.030928 0.045031 0.028807 0.067265 0.000000 0.100775 0.061728
10000023 0.055878 0.070981 0.054103 0.100914 0.053969 0.089436 0.062137 0.139563 0.060747 0.061127 ... 0.055172 0.093750 0.086498 0.123711 0.093168 0.057613 0.040359 0.322674 0.069767 0.082305
10000030 0.077783 0.051148 0.063878 0.036056 0.059965 0.046922 0.060753 0.034677 0.049479 0.058150 ... 0.051724 0.046875 0.050633 0.042710 0.018634 0.042524 0.049327 0.000000 0.027132 0.028807
10000032 0.026375 0.039666 0.030689 0.034173 0.025765 0.041945 0.032937 0.010079 0.037232 0.025073 ... 0.031034 0.031250 0.021097 0.020619 0.017081 0.045267 0.098655 0.000000 0.096899 0.028807
10000033 0.060796 0.080376 0.063424 0.060813 0.094573 0.076923 0.067672 0.068671 0.078138 0.088714 ... 0.062069 0.140625 0.086498 0.079529 0.085404 0.085048 0.139013 0.000000 0.089147 0.131687
10000038 0.080018 0.057411 0.086383 0.035079 0.063319 0.045073 0.068364 0.037752 0.065891 0.079518 ... 0.044828 0.062500 0.105485 0.070692 0.034161 0.028807 0.053812 0.000000 0.034884 0.069959
10000042 0.073759 0.037578 0.089111 0.043308 0.074042 0.051614 0.065735 0.029723 0.057440 0.058018 ... 0.058621 0.062500 0.048523 0.033873 0.029503 0.057613 0.067265 0.000000 0.081395 0.074074
10000047 0.094323 0.092902 0.089793 0.092963 0.055900 0.065264 0.083310 0.063375 0.088059 0.072374 ... 0.041379 0.041667 0.105485 0.091311 0.080745 0.071331 0.022422 0.200581 0.050388 0.045267
10000048 0.087170 0.061587 0.070925 0.052793 0.048379 0.047775 0.060338 0.047831 0.071647 0.074226 ... 0.062069 0.065104 0.061181 0.038292 0.068323 0.069959 0.085202 0.000000 0.046512 0.069959
10000060 0.083594 0.093946 0.082519 0.132854 0.068554 0.112043 0.096457 0.200205 0.070545 0.077335 ... 0.068966 0.057292 0.061181 0.076583 0.243789 0.167353 0.026906 0.209302 0.143411 0.152263
10000064 0.024139 0.048017 0.048193 0.033336 0.036030 0.035831 0.043039 0.021182 0.053644 0.051535 ... 0.027586 0.057292 0.037975 0.038292 0.021739 0.057613 0.058296 0.000000 0.038760 0.053498
10000069 0.097452 0.099165 0.091157 0.077899 0.107531 0.119295 0.106006 0.100102 0.106797 0.160757 ... 0.058621 0.210938 0.149789 0.156112 0.091615 0.145405 0.103139 0.000000 0.038760 0.115226

14 rows × 1045 columns

Q2: Given the price of a ship, are modules of a similar price more commonly used than on average?¶

In the interest of time, calculation of the ship price and the price of items on the ship has already been done for each ship in the killmail dataset, and stored in shipprice.csv.

In [20]:
shipprice = pd.read_csv("shipprice.csv")
shipprice = shipprice.dropna()
shipprice
Out[20]:
Unnamed: 0.1 Unnamed: 0 ship_id ship_price module_price
1 1 0 47269 2.558519e+07 3.350072e+07
3 3 0 24700 5.292508e+07 1.426485e+07
5 5 0 35832 7.095789e+08 8.227371e+08
7 7 0 596 1.857917e+04 1.103100e+02
8 8 0 17636 6.217667e+08 3.960204e+08
... ... ... ... ... ...
3482989 925756 0 28659 1.052254e+09 1.066701e+09
3482990 925757 0 17720 2.626386e+08 2.122895e+07
3482991 925758 0 606 4.834000e+02 1.110000e+02
3482996 925763 0 33474 1.789000e+06 0.000000e+00
3482997 925764 0 33475 7.175000e+06 0.000000e+00

1934633 rows × 5 columns

Plotly is not used here as it freezes the notebook.

In [21]:
sns.scatterplot(x=shipprice["ship_price"], y=shipprice["module_price"])
Out[21]:
<AxesSubplot: xlabel='ship_price', ylabel='module_price'>

As we can tell, it is impossible to see anything. In addition, the few outliers make the graph too dense on the bottom left side. As the prices of ships in Eve can range from around hundreds of thousands to apparently tens of billions of currency, it may be a good idea to take the logarithm. Because we take the logarithm, we should remove all prices with a value of 0.

This way, the logarithm exists, and anyways, a module price of 0 suggests that no modules were on the ship, and we do not want to consider that as it will skew the data.

In [22]:
shipprice = shipprice[shipprice["module_price"] > 0]
sns.scatterplot(x=np.log(shipprice["ship_price"]), y=np.log(shipprice["module_price"]))
Out[22]:
<AxesSubplot: xlabel='ship_price', ylabel='module_price'>

While the logarithm removes the issue of the outliers being too far out, it is still impossible to accurately see a correlation. Many of the 1.9 million points are hidden behind other points. So, we will make the graph bigger and the points smaller, so we can see the density of the points.

In [23]:
plt.figure(figsize=(50,40))
sns.scatterplot(x=np.log(shipprice["ship_price"]), y=np.log(shipprice["module_price"]),s =1)
Out[23]:
<AxesSubplot: xlabel='ship_price', ylabel='module_price'>

With this, we can see that in the middle to right side of the graph, an increase in ship price is met with an increase in the module price. However, in the left side of the graph, the points appear to be approximately randomly scattered.

The graph has a great many outliers. We can try plotting the hue as the ship type, so maybe certain ship types deviate from the rule a bit.

In [24]:
shipprice["group"] = marketgroups["marketGroupName"][pd.to_numeric(marketgroups["parentGroupID"][pd.to_numeric(inv["marketGroupID"][shipprice["ship_id"]])].tolist())].tolist()
shipprice
C:\Users\User\AppData\Local\Temp\ipykernel_28540\2225122281.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[24]:
Unnamed: 0.1 Unnamed: 0 ship_id ship_price module_price group
1 1 0 47269 2.558519e+07 3.350072e+07 Precursor Frigates
3 3 0 24700 5.292508e+07 1.426485e+07 Standard Battlecruisers
5 5 0 35832 7.095789e+08 8.227371e+08 Citadels
7 7 0 596 1.857917e+04 1.103100e+02 Corvettes
8 8 0 17636 6.217667e+08 3.960204e+08 Faction Battleships
... ... ... ... ... ... ...
3482985 925752 0 17480 3.899279e+07 5.674360e+06 Mining Barges
3482987 925754 0 16242 1.098000e+06 4.493609e+06 Standard Destroyers
3482989 925756 0 28659 1.052254e+09 1.066701e+09 Marauders
3482990 925757 0 17720 2.626386e+08 2.122895e+07 Faction Cruisers
3482991 925758 0 606 4.834000e+02 1.110000e+02 Corvettes

1569024 rows × 6 columns

In [25]:
plt.figure(figsize=(50,40))
sns.scatterplot(x=np.log(shipprice["ship_price"]), y=np.log(shipprice["module_price"]), s=5, hue=shipprice["group"])
Out[25]:
<AxesSubplot: xlabel='ship_price', ylabel='module_price'>

We see that most of the categories of ships approximately follow that rule. However, there are too many categories to analyze. So, we further group up the categories, and only take the basic combat ships which comprise most of the data.

In [26]:
fshipprice = shipprice.copy()
fshipprice["group"] = fshipprice["group"].str.split()
fshipprice = fshipprice[fshipprice["group"].str[0].isin(["Precursor", "Faction", "Advanced", "Standard", "Freighters"])]
def filter(x):
    stuff = x["group"]
    if "Freighters" in (x["group"]):
        x["group"] = "Freighters"
    else:
        x["group"] = stuff[0]
    return x
fshipprice = fshipprice.apply(filter, axis=1)
fshipprice
Out[26]:
Unnamed: 0.1 Unnamed: 0 ship_id ship_price module_price group
1 1 0 47269 2.558519e+07 3.350072e+07 Precursor
3 3 0 24700 5.292508e+07 1.426485e+07 Standard
8 8 0 17636 6.217667e+08 3.960204e+08 Faction
13 13 0 602 4.966000e+05 1.563368e+07 Standard
14 14 0 626 1.041000e+07 7.912130e+06 Standard
... ... ... ... ... ... ...
3482975 925742 0 24698 4.879308e+07 1.457094e+07 Standard
3482983 925750 0 589 6.000000e+05 4.806029e+05 Standard
3482984 925751 0 4310 6.622831e+07 5.104411e+07 Standard
3482987 925754 0 16242 1.098000e+06 4.493609e+06 Standard
3482990 925757 0 17720 2.626386e+08 2.122895e+07 Faction

1066229 rows × 6 columns

In [27]:
plt.figure(figsize=(50,40))
sns.scatterplot(x=np.log(fshipprice["ship_price"]), y=np.log(fshipprice["module_price"]), hue=fshipprice["group"],s =1)
Out[27]:
<AxesSubplot: xlabel='ship_price', ylabel='module_price'>

.

In [28]:
pearson_coef, p_value = stats.pearsonr(np.log(shipprice["ship_price"]), np.log(shipprice["module_price"]))
pearson_coef, p_value
Out[28]:
(0.8872665778833148, 0.0)

Based on the pearson coefficient and the p-value, we are confident there is indeed a strong correlation between the logarithm of the price of the ship and the logarithm of the price of the modules on it. However, this means that when taking out the logarithm, the predicted values will be a few times off.

Q3: Is the popularity of modules different when fighting player and non-player enemies?¶

In the interest of time, the number of each module recorded in pvp or pve contexts has already been calculated and stored in pnp.csv.

In [29]:
pnp = pd.read_csv("pnp.csv").set_index("Unnamed: 0")
pnp
Out[29]:
P N PN
Unnamed: 0
24700 1211.0 217.0 579.0
670 109636.0 3186.0 1006.0
16240 2634.0 5740.0 2397.0
11134 1563.0 341.0 36.0
1944 1012.0 78.0 22.0
... ... ... ...
62628 1.0 0.0 0.0
20985 0.0 0.0 8.0
61213 2.0 0.0 0.0
31332 0.0 0.0 1.0
62632 1.0 0.0 0.0

4961 rows × 3 columns

As in question 1, we should filter out data with very low numbers of occurences. We also take out the column PN as it is not used here.

In [30]:
pnp = pnp[pnp.sum(axis=1) > 100][["P","N"]]
pnp
Out[30]:
P N
Unnamed: 0
24700 1211.0 217.0
670 109636.0 3186.0
16240 2634.0 5740.0
11134 1563.0 341.0
1944 1012.0 78.0
... ... ...
53343 94.0 44.0
54291 495.0 147.0
58919 177.0 36.0
58972 136.0 36.0
58966 77.0 17.0

1998 rows × 2 columns

As we are only interested in comparing the relative proportions of modules used in these contexts, we will divide accordingly.

In [31]:
pnp = pnp / pnp.sum()
pnp = (pnp.T / pnp.sum(axis=1)).T
pnp = pnp.sort_values(by="P", ascending=True)
pnp
Out[31]:
P N
Unnamed: 0
2003 0.008755 0.991245
7937 0.027439 0.972561
31502 0.052395 0.947605
20795 0.052818 0.947182
31312 0.058529 0.941471
... ... ...
21926 1.000000 0.000000
20060 1.000000 0.000000
37532 1.000000 0.000000
28351 1.000000 0.000000
20064 1.000000 0.000000

1998 rows × 2 columns

Now we plot the proportion of modules found in pvp contexts.

In [32]:
plt.figure(figsize=(200,100))
sns.barplot(x = inv["typeName"][pnp.index.values], y=pnp["P"])
Out[32]:
<AxesSubplot: xlabel='typeName', ylabel='P'>
In [33]:
plt.figure(figsize=(20,10))

sns.kdeplot(data=pnp["P"])
Out[33]:
<AxesSubplot: xlabel='P', ylabel='Density'>

From the graph, we can deduce that there are approximately equal numbers of modules used more in pvp contexts as ones used more in pve contexts. The graph is also approximately linear, so the distribution of the popularity of items is approximately uniform. However, the graph curves at the ends, so there are comparatively few modules that are almost exclusively used for fighting player or non player enemies.

Zooming on the top right side of the graph:

In [34]:
plt.figure(figsize = (20,10))
plot = sns.barplot(x = inv["typeName"][pnp[pnp["P"] == 1].index.values], y=pnp[pnp["P"] == 1]["P"])
plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
plt.show()

Zooming in on the bottom left side of the graph:

In [35]:
plt.figure(figsize = (20,10))
plot = sns.barplot(x = inv["typeName"][pnp[pnp["P"] <= 0.15].index.values], y=pnp[pnp["P"] <= 0.15]["P"])
plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
plt.show()

Upon further investigation, the items that have been found only in pvp situations are mostly structures and items on structures. This makes sense, as non-player enemies do not go around the solar system invading structures and killing them, but players do.

Q4: Given a certain module which modules are most commonly fit alongisde it?¶

In the interest of time, the number of occurences of each pair of modules (duplicates not counted) has already been calculated and stored in pairing.csv.

In [36]:
pairs = pd.read_csv("pairing.csv").set_index("0")
pairs
Out[36]:
178 179 180 181 182 183 184 185 186 187 ... 62590 62591 62622 62625 62628 62631 62632 62636 63140 63165
0
178 0.0 12.0 9.0 10.0 16.0 7.0 11.0 30.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
179 12.0 0.0 6.0 10.0 12.0 9.0 11.0 19.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
180 9.0 6.0 0.0 11.0 13.0 8.0 10.0 15.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
181 10.0 10.0 11.0 0.0 11.0 5.0 11.0 17.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
182 16.0 12.0 13.0 11.0 0.0 12.0 16.0 34.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
62631 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
62632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
62636 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
63140 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
63165 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4162 rows × 4162 columns

Now we can try to plot a heatmap to see which pairs are most commonly seen. Trust me, I have done it and with the >4000 modules it is impossible to see anything. Check "heatmap.png" to see what it looks like. It cannot be displayed here without lagging the notebook.

So instead, we will plot the item groups instead of every single item against every other item.

Once again, in the interest of time, the number of occurences of groups of modules has already been calculated and stored in region_pairing.csv.

In [37]:
regionpairs = pd.read_csv("region_pairing.csv").set_index('group')
regionpairs
Out[37]:
102 103 105 106 107 108 109 112 113 116 ... 2738 2740 2742 2743 2744 2783 2795 2804 2805 2815
group
102 1344.0 31.0 9.0 0.0 140.0 15.0 0.0 4.0 71.0 3.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
103 31.0 228.0 10.0 0.0 7.0 41.0 0.0 18.0 3.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
105 9.0 10.0 124.0 3.0 0.0 0.0 2.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
106 0.0 0.0 3.0 20.0 5.0 10.0 5.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
107 140.0 7.0 0.0 5.0 1078.0 74.0 0.0 6.0 130.0 12.0 ... 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2783 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 4.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2795 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2804 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0
2805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0
2815 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0

400 rows × 400 columns

We want relatively accurate data, so module groups with too few occurences will be excluded.

In [38]:
regionpairs = regionpairs.loc[:, regionpairs.sum() > 200]
regionpairs = regionpairs.loc[pd.to_numeric(regionpairs.columns.values.tolist())]
regionpairs
Out[38]:
102 103 105 106 107 108 109 112 113 116 ... 2467 2468 2469 2470 2471 2509 2529 2783 2804 2805
group
102 1344.0 31.0 9.0 0.0 140.0 15.0 0.0 4.0 71.0 3.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
103 31.0 228.0 10.0 0.0 7.0 41.0 0.0 18.0 3.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
105 9.0 10.0 124.0 3.0 0.0 0.0 2.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
106 0.0 0.0 3.0 20.0 5.0 10.0 5.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
107 140.0 7.0 0.0 5.0 1078.0 74.0 0.0 6.0 130.0 12.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2509 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2529 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2783 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 4.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2804 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0
2805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0

350 rows × 350 columns

Now we can plot the heatmap.

In [45]:
normalized = (regionpairs / regionpairs.sum()).T
In [46]:
plt.figure(figsize=(75,70))
sns.heatmap(np.log(np.log(normalized+1)+1), cmap='turbo')
plt.show()

The values on each row add up to 1. So in each row, the brightest points in that row are the modules that are most commonly fit alongside it. The columns are not normalized in the same way, so brighter columns are modules that appear a lot more in the dataset.

In [47]:
grouppair2 = pd.read_csv("region_pairing2.csv").set_index("ngroup")
grouppair2
Out[47]:
9 10 11 14 52 114 115 117 118 120 ... 2227 2297 2340 2432 2463 2464 2527 2729 2730 2741
ngroup
9 56188.0 0.0 21344.0 24177.0 42742.0 9.0 232.0 3899.0 383.0 3377.0 ... 0.0 1012.0 0.0 31.0 15.0 16.0 5.0 0.0 0.0 0.0
10 0.0 0.0 3409.0 346.0 3028.0 0.0 1.0 5.0 6.0 71.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 21344.0 3409.0 35472.0 49863.0 78048.0 574.0 145.0 10330.0 4802.0 4230.0 ... 4.0 2003.0 3.0 1176.0 475.0 701.0 27.0 1.0 0.0 1.0
14 24177.0 346.0 49863.0 800.0 111828.0 393.0 150.0 10255.0 7564.0 8634.0 ... 0.0 3208.0 0.0 1237.0 486.0 740.0 25.0 5.0 2.0 7.0
52 42742.0 3028.0 78048.0 111828.0 16710.0 988.0 353.0 22386.0 12954.0 23317.0 ... 0.0 4907.0 0.0 1684.0 644.0 1029.0 35.0 7.0 12.0 19.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2464 16.0 0.0 701.0 740.0 1029.0 0.0 0.0 0.0 0.0 38.0 ... 0.0 144.0 0.0 995.0 0.0 0.0 0.0 0.0 0.0 0.0
2527 5.0 0.0 27.0 25.0 35.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2729 0.0 0.0 1.0 5.0 7.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0
2730 0.0 0.0 0.0 2.0 12.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.0
2741 0.0 0.0 1.0 7.0 19.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 12.0 0.0

98 rows × 98 columns

In [48]:
grouppair2 = grouppair2.loc[:, grouppair2.sum() > 200]
grouppair2 = grouppair2.loc[pd.to_numeric(grouppair2.columns.values.tolist())]
grouppair2
Out[48]:
9 10 11 14 52 114 115 117 118 120 ... 2208 2209 2226 2227 2297 2340 2432 2463 2464 2527
ngroup
9 56188.0 0.0 21344.0 24177.0 42742.0 9.0 232.0 3899.0 383.0 3377.0 ... 0.0 0.0 0.0 0.0 1012.0 0.0 31.0 15.0 16.0 5.0
10 0.0 0.0 3409.0 346.0 3028.0 0.0 1.0 5.0 6.0 71.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 21344.0 3409.0 35472.0 49863.0 78048.0 574.0 145.0 10330.0 4802.0 4230.0 ... 6.0 0.0 5.0 4.0 2003.0 3.0 1176.0 475.0 701.0 27.0
14 24177.0 346.0 49863.0 800.0 111828.0 393.0 150.0 10255.0 7564.0 8634.0 ... 0.0 0.0 0.0 0.0 3208.0 0.0 1237.0 486.0 740.0 25.0
52 42742.0 3028.0 78048.0 111828.0 16710.0 988.0 353.0 22386.0 12954.0 23317.0 ... 0.0 0.0 0.0 0.0 4907.0 0.0 1684.0 644.0 1029.0 35.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2340 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 93.0 19.0 76.0 52.0 0.0 188.0 0.0 0.0 0.0 0.0
2432 31.0 0.0 1176.0 1237.0 1684.0 0.0 0.0 0.0 0.0 62.0 ... 0.0 0.0 0.0 0.0 229.0 0.0 0.0 627.0 995.0 0.0
2463 15.0 0.0 475.0 486.0 644.0 0.0 0.0 0.0 0.0 24.0 ... 0.0 0.0 0.0 0.0 85.0 0.0 627.0 0.0 0.0 0.0
2464 16.0 0.0 701.0 740.0 1029.0 0.0 0.0 0.0 0.0 38.0 ... 0.0 0.0 0.0 0.0 144.0 0.0 995.0 0.0 0.0 0.0
2527 5.0 0.0 27.0 25.0 35.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

94 rows × 94 columns

In [49]:
normalized = (grouppair2 / grouppair2.sum()).T
normalized.columns = marketgroups["marketGroupName"][pd.to_numeric(normalized.columns.values).tolist()]
normalized.index = marketgroups["marketGroupName"][pd.to_numeric(normalized.index.values).tolist()]
normalized.index.name = "row"
normalized.columns.name='column'
fig = px.imshow(np.log(np.log(normalized+1)+1), text_auto = True, width=600, height=500)
fig.update_layout().update_yaxes(automargin=False).update_xaxes(automargin=False)
fig.show()
#plt.show()

Results Findings & Conclusion

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

Q1: In a given region, are the modules more rarely used based on the availability of the modules in the region?¶

In [50]:
for ind in regions.index.values:
    plt.figure(figsize=(20,8))
    bot15 = regions.loc[ind].sort_values(ascending=True)[0:]
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = px.bar(x = inv.loc[x, "typeName"], y=bot15.values)
    plot.update_layout(title_text = (regi.loc[ind, "regionName"]), yaxis_title = "Relative Frequency", xaxis_title = "Module")
    plot.show()
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>
<Figure size 1440x576 with 0 Axes>

The above graph shows the distribution of how many times each module appears in the killmails for each region. It is approximately linear in the middle, but curves towards the extremes at the left and right end. This trend appears for all the modules.

In [51]:
for ind in regions2.index.values:
    plt.figure(figsize=(15,6))
    
    bot15 = regions2.loc[ind].sort_values(ascending=True)
    x2 = bot15.index.values.tolist()
    x = pd.to_numeric(bot15.index.values).tolist()
    plot = sns.barplot(x = inv.loc[x, "typeName"], y=bot15.values, alpha=0.5, color='#FF0000')
    volums = []
    if (ind != 10000002):
        volum = volumes.loc[ind]
        for i in x2:
            try:
                volums.append(volum[i])
            except:
                volums.append(np.NaN)
        plot2 = sns.barplot(x = inv.loc[x, "typeName"], y=volums, alpha=0.5, color='#0000FF') 
        plot2.set_xticklabels([])
    plt.title(regi.loc[ind, "regionName"])
    plt.xlabel("Modules")
    plt.ylabel("Relative Frequency/Volume")
    plot.set_xticklabels([])
    plt.show()

The y axis represents the proportion of market volume for that product in the region, or the proportion of it seen in the killmail. The x axis is the item id. When we sort the module frequency in ascending order, the module volumes appear to be randomly scattered. But in general, we might see a very slight increase in the relative volume of the modules. However, it is much smaller than the increase in module frequency.

In addition, some regions appear to have the blue graph extend far above the red graph, while some regions have the blue graph almost completely contained within the red graph. This means some regions have higher average module usage, while some regions have higher average market volume. So the average usage of modules in the region does not neccessarily depend on the average market volume in that region. This also refutes the hypothesis that there is moderate correlation.

In conclusion, there is very little correlation between use of modules and their availability.

Q2: Given the price of a ship, are modules of a similar price more commonly used than on average?¶

In [52]:
plt.figure(figsize=(30,20))
sns.scatterplot(x=np.log(fshipprice["ship_price"]), y=np.log(fshipprice["module_price"]), hue=fshipprice["group"],s =1)
plt.plot([10,22], [10,22], color="red")
plt.title("Graph of module prices against the price of the ship they are on")
plt.annotate("y=x", (10, 9.8))
plt.xlabel("Natural logarithm of ship price")
plt.ylabel("Natural logarithm of module price")
Out[52]:
Text(0, 0.5, 'Natural logarithm of module price')

From the graph, the densest areas form a slight upward trend, with the exception of the purple group, which is freighters. The orange group (standard ships) forms the bulk of the left side of the graph, but the price of the modules fit on them do not increase that much as the ship increases. The blue points (precursor ships) start slightly higher than the green points (faction ships), but they both gradually trend to the same module prices

The advanced ships (red) are very rarely seen, and occupy a rather small range of values on the x axis, so any trend is hard to see.

The standard ships occupy a greater range of x values, and the price of the modules generally increases as the price of the ship increases. However the slops on the graph is very gradual, meaning the prices of modules do not increase as much as the prices of the ship they are on.

The precursor ships start slightly above the rest of the graph, meaning the prices of the modules fit on them are on average higher for the cheaper ships, but on the more expensive ships are, on average, around the same price.

Faction ships always appear to be slightly below the y=x line, meaning the modules fit on them are on average cheaper than the rest of the ship.

Freighters are an exception, with them being much more expensive than the modules usually fit on them. This could be because freighters are used to carry items and do almost no fighting. The ones appearing on the killmails could be shot down by a fleet of players seeking the cargo in the freighter. This means these freighters do little combat, and so there would not be a good reason to over-spend on modules that only marginally increase its power.

However, this graph is a logarithmic graph, so a slight deviation from the line means a multiplication or division by a not insignificant number. So any attempts to predict the prices will be off by several times.

Q3: Is the popularity of modules different when fighting player and non-player enemies?¶

In [53]:
plt.figure(figsize=(200,100))
sns.barplot(x = inv["typeName"][pnp.index.values], y=pnp["P"])
plt.title("Usage rates of modules in PvP scenarios")
plt.xlabel("Modules")
plt.ylabel("Usage rate")
Out[53]:
Text(0, 0.5, 'Usage rate')
In [54]:
sns.kdeplot(data=pnp["P"])
plt.title("Density of modules over the PvP usage rate")
plt.xlabel("PvP usage rate")
Out[54]:
Text(0.5, 0, 'PvP usage rate')

From the graphs, we notice that the PvP usage rate of modules or ships is approximately normally distributed, but with a slight left skew and a slightly flatter top end. This means that items are not as unlikely to have PvP usage rates that are slightly above or below 50%.

This means that the popularity of items is, on average, different against player or non-player enemies. This means that PvP or PvE fights are different enough that there are many items that would be preferred for one situation or the other.

So yes, the popularity of modules or ships does tend to be at least slightly different for PvP or PvE situations.

Additionally, from the top end, there are things that are only used in PvP situations.

In [55]:
plt.figure(figsize = (20,10))
plot = sns.barplot(x = inv["typeName"][pnp[pnp["P"] == 1].index.values], y=pnp[pnp["P"] == 1]["P"])
plot.set_xticklabels(plot.get_xticklabels(), rotation=30, ha="right")
plt.title("Modules or ships exclusively used in PvP")
plt.xlabel("Modules")
plt.ylabel("PvP Usage Rate")
plt.show()

Judging from the module names, these are modules that are fit on structures and capital ships, along with such structures and capital ships. This is likely because they are massive and impractical to use on raiding the generally more secluded bases of pirates and other non-player enemies. Additionally, structures are rooted, so it is impossible to use them to attack things that are not players, as they do not move out from their bases.

Q4: Given a certain module which modules are most commonly fit alongisde it?¶

In [56]:
normalized = (grouppair2 / grouppair2.sum()).T
normalized.columns = marketgroups["marketGroupName"][pd.to_numeric(normalized.columns.values).tolist()]
normalized.index = marketgroups["marketGroupName"][pd.to_numeric(normalized.index.values).tolist()]
normalized.index.name = "row"
normalized.columns.name='column'
fig = px.imshow(np.log(np.log(normalized+1)+1), text_auto = True, width=600, height=500)
fig.update_layout(title_text = "Which module groups are commonly paired together?", yaxis_title = "", xaxis_title = "").update_yaxes(automargin=False).update_xaxes(automargin=False)
fig.show()
#plt.show()

The trend seems to be that modules are paired with their rigs, for example the scanning rigs are paired with scanner modules, harvesting equipment is paired with resource processing rigs, electronic warfare is paired with their rigs, etc. This makes sense as rigs have bonuses to that specific item. In addition, items of similar types seem to also be paired with each other, like structure engineering rigs, electronic warfare and energy neutralizers, scanning equipment with other scanning equipment, etc. The brighter points on the heatmap are the pairs of module groups that occur the most number of times relative to other pairs.

Recommendations or Further Works

State any recommendations, improvements or further works.
  • With more market data, in the future it may be benefitical to analyze the market more with how it affects the usage of modules.
  • With more time I could better sort out the item groups to filter the more useful groups.
  • I could also make a dedicated application to access the usage rate of modules instead of just plotly and zooming. This would be helpful in the game itself.

References

Cite any references made, and links where you obtained the data. You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet

All the links have been stated under the dataset. The only other reference is EVE Online itself.

Appendix

There are many additional notebooks that helped in the data processing and downloading, as well as early stage EDA.

  • data, data2, data3: Used in downloading of killmails
  • massivefile: used for combining and sampling killmails and general file processing
  • market1 - market18: Run in parallel to download historical market data
  • marketcombine: To process market data and combine them
  • pairings1 - pairings5: Used to process the Q4 data
  • ship_price and ship_price2: used for Q2
  • pve_or_pvp and pnp_eda: used for Q3
  • region, regions and region_eda: used for Q1

https://drive.google.com/file/d/14tmkJL2JIrz3QHsyUNz2nCEzGWqODzwu/view?usp=sharing and here is all the rest of the raw data

In [ ]: